hinge_loss#
Hinge loss is a margin-based loss for classification. It’s the standard convex surrogate behind the (soft-margin) Support Vector Machine (SVM).
This notebook:
defines binary and multiclass hinge loss with consistent notation
builds intuition with Plotly plots
implements the loss (and a useful subgradient) from scratch in NumPy
uses hinge loss to optimize a simple linear classifier (primal SVM-style)
Quick import#
from sklearn.metrics import hinge_loss
Important:
hinge_lossexpects decision scores (real-valued margins), not probabilities.
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from dataclasses import dataclass
from sklearn.datasets import make_blobs
from sklearn.metrics import hinge_loss as skl_hinge_loss
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(42)
## 1) Binary hinge loss (definition)
Binary classification with a **real-valued score**:
- label: $y \in \{-1, +1\}$
- model score: $s = f(x) \in \mathbb{R}$
- prediction: $\hat{y} = \mathrm{sign}(s)$
The key quantity is the **(signed) margin**:
$$
m = y\,s.
$$
- If $m > 0$, the example is classified correctly.
- Larger $m$ means “more confident” (further from the decision boundary).
The **hinge loss** is:
$$
\ell(y, s) = \max(0, 1 - y s) = \max(0, 1 - m).
$$
Average hinge loss over a dataset:
$$
L = \frac{1}{n}\sum_{i=1}^n \max(0, 1 - y_i s_i).
$$
### Relationship to 0–1 loss
The 0–1 loss is $\mathbb{1}[m \le 0]$ (wrong sign).
Hinge loss is a **convex upper bound**:
$$
\mathbb{1}[m \le 0] \;\le\; \max(0, 1 - m).
$$
So minimizing hinge loss tends to reduce classification errors while also encouraging a **margin** ($m \ge 1$ gives zero loss).
m = np.linspace(-3, 3, 600)
loss_01 = (m <= 0).astype(float)
loss_hinge = np.maximum(0.0, 1.0 - m)
loss_sq_hinge = np.maximum(0.0, 1.0 - m) ** 2
fig = go.Figure()
fig.add_trace(go.Scatter(x=m, y=loss_01, name="0-1 loss 𝟙[m≤0]", line=dict(dash="dash")))
fig.add_trace(go.Scatter(x=m, y=loss_hinge, name="hinge max(0, 1-m)", line=dict(width=3)))
fig.add_trace(go.Scatter(x=m, y=loss_sq_hinge, name="squared hinge (variant)", line=dict(dash="dot")))
fig.add_vline(x=0, line_dash="dot", line_color="gray")
fig.add_vline(x=1, line_dash="dot", line_color="gray")
fig.update_layout(
title="Loss as a function of margin m = y·score",
xaxis_title="margin m",
yaxis_title="loss",
legend_title="",
)
fig.show()
## 2) Intuition: which points are penalized?
Because $\ell(m)=\max(0, 1-m)$:
- **Misclassified** points ($m \le 0$) get loss $\ge 1$.
- **Correct but too close** to the boundary ($0 < m < 1$) still get *some* loss.
- **Confident** points ($m \ge 1$) get **zero** loss.
This is why hinge-based models often end up depending heavily on a subset of points (those with $m \le 1$), commonly called **support vectors** in the SVM context.
m_samples = np.linspace(-2.5, 2.5, 60)
loss_samples = np.maximum(0.0, 1.0 - m_samples)
category = np.where(
m_samples <= 0,
"misclassified (m ≤ 0)",
np.where(m_samples < 1, "correct but within margin (0 < m < 1)", "confident (m ≥ 1)"),
)
fig = px.scatter(
x=m_samples,
y=loss_samples,
color=category,
title="Only points with margin m < 1 contribute to hinge loss",
)
fig.add_vline(x=0, line_dash="dot", line_color="gray")
fig.add_vline(x=1, line_dash="dot", line_color="gray")
fig.update_layout(xaxis_title="margin m", yaxis_title="hinge loss")
fig.show()
## 3) Multiclass hinge loss (Crammer–Singer)
For $K$ classes, assume a score vector:
$$
s(x) \in \mathbb{R}^K, \quad s_k(x) = \text{score for class } k.
$$
If the true class is $y \in \{0,\dots,K-1\}$, the multiclass hinge loss is:
$$
\ell(y, s) = \max\big(0, 1 + \max_{j \ne y} s_j - s_y\big).
$$
It enforces a **margin** between the true class score and the best competing score:
$$
s_y \ge \max_{j \ne y} s_j + 1 \quad \Rightarrow \quad \ell = 0.
$$
This is the formulation used by `sklearn.metrics.hinge_loss` when `pred_decision` is shaped `(n_samples, n_classes)`.
def _as_1d_float(x: np.ndarray) -> np.ndarray:
x = np.asarray(x, dtype=float)
if x.ndim != 1:
raise ValueError(f"Expected a 1D array, got shape={x.shape}")
return x
def binary_hinge_loss(
y_true: np.ndarray,
scores: np.ndarray,
*,
margin: float = 1.0,
sample_weight: np.ndarray | None = None,
reduction: str = "mean",
) -> float:
"""Binary hinge loss: mean_i max(0, margin - y_i * score_i).
Accepts labels in {0,1} or {-1,+1}. `scores` are raw decision scores.
"""
y = _as_1d_float(y_true)
s = _as_1d_float(scores)
if y.shape[0] != s.shape[0]:
raise ValueError(
f"y_true and scores must match in length, got {y.shape[0]} vs {s.shape[0]}"
)
uniques = set(np.unique(y).tolist())
if uniques.issubset({0.0, 1.0}):
y = np.where(y == 0.0, -1.0, 1.0)
elif not uniques.issubset({-1.0, 1.0}):
raise ValueError(
f"For binary hinge loss, y_true must be in {{0,1}} or {{-1,1}}, got {sorted(uniques)}"
)
loss = np.maximum(0.0, margin - y * s)
if sample_weight is not None:
w = _as_1d_float(sample_weight)
if w.shape[0] != loss.shape[0]:
raise ValueError("sample_weight must have the same length as y_true")
if reduction == "mean":
return float(np.sum(w * loss) / np.sum(w))
if reduction == "sum":
return float(np.sum(w * loss))
raise ValueError("reduction must be 'mean' or 'sum'")
if reduction == "mean":
return float(np.mean(loss))
if reduction == "sum":
return float(np.sum(loss))
raise ValueError("reduction must be 'mean' or 'sum'")
def multiclass_hinge_loss(
y_true: np.ndarray,
scores: np.ndarray,
*,
margin: float = 1.0,
sample_weight: np.ndarray | None = None,
reduction: str = "mean",
) -> float:
"""Multiclass hinge loss (Crammer–Singer): mean_i max(0, margin + max_{j!=y} s_ij - s_i,y).
`y_true` are integer class labels in [0, K-1]. `scores` has shape (n, K).
"""
y = np.asarray(y_true)
s = np.asarray(scores, dtype=float)
if y.ndim != 1:
raise ValueError(f"y_true must be 1D, got shape={y.shape}")
if s.ndim != 2:
raise ValueError(f"scores must be 2D, got shape={s.shape}")
n, k = s.shape
if y.shape[0] != n:
raise ValueError("y_true and scores must match in n_samples")
y = y.astype(int)
if y.min() < 0 or y.max() >= k:
raise ValueError(f"y_true values must be in [0, {k-1}]")
true_scores = s[np.arange(n), y]
s_other = s.copy()
s_other[np.arange(n), y] = -np.inf
max_other = np.max(s_other, axis=1)
loss = np.maximum(0.0, margin + max_other - true_scores)
if sample_weight is not None:
w = _as_1d_float(sample_weight)
if w.shape[0] != loss.shape[0]:
raise ValueError("sample_weight must have the same length as y_true")
if reduction == "mean":
return float(np.sum(w * loss) / np.sum(w))
if reduction == "sum":
return float(np.sum(w * loss))
raise ValueError("reduction must be 'mean' or 'sum'")
if reduction == "mean":
return float(np.mean(loss))
if reduction == "sum":
return float(np.sum(loss))
raise ValueError("reduction must be 'mean' or 'sum'")
# --- Binary: compare against sklearn.metrics.hinge_loss ---
y_true_01 = np.array([0, 1, 0, 1])
score = np.array([-0.2, 0.5, 0.3, 1.2])
skl = skl_hinge_loss(y_true_01, score)
ours = binary_hinge_loss(y_true_01, score)
print("binary | sklearn:", skl)
print("binary | numpy :", ours)
# --- Multiclass: compare against sklearn.metrics.hinge_loss ---
y_true_mc = np.array([0, 1, 2])
scores_mc = np.array(
[
[2.0, 0.0, -1.0],
[0.1, 0.2, 0.0],
[-1.0, 0.0, 3.0],
]
)
skl_mc = skl_hinge_loss(y_true_mc, scores_mc)
ours_mc = multiclass_hinge_loss(y_true_mc, scores_mc)
print("multiclass | sklearn:", skl_mc)
print("multiclass | numpy :", ours_mc)
binary | sklearn: 0.65
binary | numpy : 0.65
multiclass | sklearn: 0.3
multiclass | numpy : 0.30000000000000004
## 4) Using hinge loss to optimize a linear classifier (soft-margin SVM style)
A common choice is a linear score function:
$$
s_i = f(x_i) = w^T x_i + b.
$$
A soft-margin (primal) SVM objective is:
$$
J(w,b) = \frac{1}{2}\lVert w \rVert^2 + C\,\frac{1}{n}\sum_{i=1}^n \max\big(0, 1 - y_i(w^T x_i + b)\big).
$$
- The $\tfrac12\lVert w \rVert^2$ term is **L2 regularization** (prefers a wider margin).
- $C>0$ trades off margin size vs hinge penalties.
### Subgradient (what we need for gradient descent)
The hinge part is **not differentiable** at $m_i = 1$.
But it’s convex, so we can use a **subgradient**.
Let $m_i = y_i(w^T x_i + b)$ and define the “violators”:
$$
\mathcal{V} = \{i : m_i < 1\}.
$$
A convenient subgradient is:
$$
\nabla_w J = w - \frac{C}{n}\sum_{i\in\mathcal{V}} y_i x_i,
\qquad
\nabla_b J = - \frac{C}{n}\sum_{i\in\mathcal{V}} y_i.
$$
We’ll implement full-batch subgradient descent below.
@dataclass
class LinearSVMHistory:
objective: list[float]
mean_hinge: list[float]
accuracy: list[float]
def linear_svm_objective(
w: np.ndarray, b: float, X: np.ndarray, y: np.ndarray, *, C: float = 1.0
) -> tuple[float, float]:
scores = X @ w + b
hinge = np.maximum(0.0, 1.0 - y * scores)
obj = 0.5 * float(w @ w) + C * float(np.mean(hinge))
return obj, float(np.mean(hinge))
def linear_svm_subgrad(
w: np.ndarray, b: float, X: np.ndarray, y: np.ndarray, *, C: float = 1.0
) -> tuple[np.ndarray, float]:
n = X.shape[0]
scores = X @ w + b
margins = y * scores
viol = margins < 1.0
grad_w = w.copy()
grad_b = 0.0
if np.any(viol):
grad_w -= (C / n) * (X[viol].T @ y[viol])
grad_b = -(C / n) * float(np.sum(y[viol]))
return grad_w, grad_b
def train_linear_svm_subgradient_descent(
X: np.ndarray,
y: np.ndarray,
*,
C: float = 1.0,
lr: float = 0.2,
n_epochs: int = 200,
seed: int = 42,
) -> tuple[np.ndarray, float, LinearSVMHistory]:
"""Train a linear classifier with L2 + hinge using full-batch subgradient descent."""
rng_local = np.random.default_rng(seed)
w = rng_local.normal(scale=0.01, size=X.shape[1])
b = 0.0
hist = LinearSVMHistory(objective=[], mean_hinge=[], accuracy=[])
for _ in range(n_epochs):
obj, mean_hinge = linear_svm_objective(w, b, X, y, C=C)
scores = X @ w + b
y_pred = np.where(scores >= 0.0, 1.0, -1.0)
acc = float(np.mean(y_pred == y))
hist.objective.append(obj)
hist.mean_hinge.append(mean_hinge)
hist.accuracy.append(acc)
grad_w, grad_b = linear_svm_subgrad(w, b, X, y, C=C)
w = w - lr * grad_w
b = b - lr * grad_b
return w, b, hist
# --- Make a simple dataset ---
X_raw, y01 = make_blobs(n_samples=250, centers=2, cluster_std=1.8, random_state=42)
y_pm1 = np.where(y01 == 0, -1.0, 1.0)
scaler = StandardScaler()
X = scaler.fit_transform(X_raw)
w, b, hist = train_linear_svm_subgradient_descent(X, y_pm1, C=2.0, lr=0.15, n_epochs=220)
print("final objective:", hist.objective[-1])
print("final mean hinge:", hist.mean_hinge[-1])
print("final accuracy :", hist.accuracy[-1])
final objective: 0.5771675247777874
final mean hinge: 0.12760727393472507
final accuracy : 0.996
from plotly.subplots import make_subplots
epochs = np.arange(len(hist.objective))
fig = make_subplots(
rows=3,
cols=1,
shared_xaxes=True,
vertical_spacing=0.06,
subplot_titles=(
"Objective (0.5||w||^2 + C·mean_hinge)",
"Mean hinge loss",
"Accuracy",
),
)
fig.add_trace(go.Scatter(x=epochs, y=hist.objective, name="objective"), row=1, col=1)
fig.add_trace(go.Scatter(x=epochs, y=hist.mean_hinge, name="mean hinge"), row=2, col=1)
fig.add_trace(go.Scatter(x=epochs, y=hist.accuracy, name="accuracy"), row=3, col=1)
fig.update_yaxes(title_text="value", row=1, col=1)
fig.update_yaxes(title_text="value", row=2, col=1)
fig.update_yaxes(title_text="", row=3, col=1, range=[0, 1.02])
fig.update_xaxes(title_text="epoch", row=3, col=1)
fig.update_layout(height=700, title="Training curves (full-batch subgradient descent)")
fig.show()
# Visualize decision boundary + margin band in 2D
x1_min, x1_max = float(X[:, 0].min() - 1.0), float(X[:, 0].max() + 1.0)
x2_min, x2_max = float(X[:, 1].min() - 1.0), float(X[:, 1].max() + 1.0)
xs = np.linspace(x1_min, x1_max, 200)
w0, w1 = float(w[0]), float(w[1])
def boundary_line(level: float) -> tuple[np.ndarray, np.ndarray]:
"""Return points (x1, x2) satisfying w0*x1 + w1*x2 + b = level."""
if abs(w1) > 1e-10:
x1 = xs
x2 = (level - b - w0 * x1) / w1
return x1, x2
# Vertical line fallback
x1 = np.full_like(xs, (level - b) / w0)
x2 = np.linspace(x2_min, x2_max, xs.shape[0])
return x1, x2
margins = y_pm1 * (X @ w + b)
support = margins <= 1.0 + 1e-12
fig = go.Figure()
# points by class
for cls, color in [(-1.0, "#1f77b4"), (1.0, "#d62728")]:
mask = y_pm1 == cls
fig.add_trace(
go.Scatter(
x=X[mask, 0],
y=X[mask, 1],
mode="markers",
name=f"y={int(cls)}",
marker=dict(size=8, color=color, line=dict(width=0)),
)
)
# highlight support vectors
fig.add_trace(
go.Scatter(
x=X[support, 0],
y=X[support, 1],
mode="markers",
name="support (m ≤ 1)",
marker=dict(size=14, color="rgba(0,0,0,0)", line=dict(width=2, color="black")),
)
)
# decision boundary and margins
for level, name, dash, width, color in [
(0.0, "decision f(x)=0", "solid", 3, "black"),
(1.0, "+margin f(x)=+1", "dash", 2, "gray"),
(-1.0, "-margin f(x)=-1", "dash", 2, "gray"),
]:
x1, x2 = boundary_line(level)
fig.add_trace(
go.Scatter(
x=x1,
y=x2,
mode="lines",
name=name,
line=dict(dash=dash, width=width, color=color),
)
)
fig.update_layout(
title="Learned linear classifier with hinge loss (support vectors highlighted)",
xaxis_title="x1 (scaled)",
yaxis_title="x2 (scaled)",
)
fig.update_xaxes(range=[x1_min, x1_max])
fig.update_yaxes(range=[x2_min, x2_max])
fig.show()
## 5) The role of C (regularization trade-off)
In the objective
$$
\tfrac12\lVert w\rVert^2 + C\,\text{mean hinge},
$$
- **small `C`**: regularization dominates → wider margin, more tolerance for violations
- **large `C`**: hinge penalties dominate → tries harder to fit training points (narrower margin)
Below we train three models with different `C` values and compare the resulting decision boundaries.
from plotly.subplots import make_subplots
Cs = [0.2, 2.0, 20.0]
models: list[tuple[float, np.ndarray, float]] = []
for C in Cs:
w_c, b_c, _ = train_linear_svm_subgradient_descent(
X, y_pm1, C=C, lr=0.15, n_epochs=220, seed=42
)
models.append((C, w_c, b_c))
fig = make_subplots(rows=1, cols=len(Cs), subplot_titles=[f"C={C}" for C in Cs])
for col, (C, w_c, b_c) in enumerate(models, start=1):
# data
for cls, color in [(-1.0, "#1f77b4"), (1.0, "#d62728")]:
mask = y_pm1 == cls
fig.add_trace(
go.Scatter(
x=X[mask, 0],
y=X[mask, 1],
mode="markers",
marker=dict(size=6, color=color),
showlegend=(col == 1),
name=f"y={int(cls)}",
),
row=1,
col=col,
)
# boundary (only f(x)=0 to keep it readable)
w0, w1 = float(w_c[0]), float(w_c[1])
if abs(w1) > 1e-10:
x1 = xs
x2 = (0.0 - b_c - w0 * x1) / w1
else:
x1 = np.full_like(xs, (0.0 - b_c) / w0)
x2 = np.linspace(x2_min, x2_max, xs.shape[0])
fig.add_trace(
go.Scatter(
x=x1,
y=x2,
mode="lines",
line=dict(width=3, color="black"),
showlegend=False,
name="boundary",
),
row=1,
col=col,
)
fig.update_layout(
height=420,
title="Effect of C on the learned decision boundary",
)
for col in range(1, len(Cs) + 1):
fig.update_xaxes(title_text="x1", range=[x1_min, x1_max], row=1, col=col)
fig.update_yaxes(title_text="x2", range=[x2_min, x2_max], row=1, col=col)
fig.show()
6) Practical usage: sklearn.metrics.hinge_loss#
sklearn.metrics.hinge_loss(y_true, pred_decision, ...) expects:
binary:
pred_decision.shape == (n_samples,)(a real-valued decision score)multiclass:
pred_decision.shape == (n_samples, n_classes)(one score per class)
A common workflow:
train a classifier that exposes
decision_functioncompute
pred_decision = model.decision_function(X)evaluate with
hinge_loss(y_true, pred_decision)
Below we fit LinearSVC and compare sklearn’s hinge loss to our NumPy implementation.
clf = LinearSVC(C=2.0, dual=True, random_state=42)
clf.fit(X, y01)
dec = clf.decision_function(X)
skl = skl_hinge_loss(y01, dec)
ours = binary_hinge_loss(np.where(y01 == 0, -1.0, 1.0), dec)
print("sklearn hinge_loss:", skl)
print("numpy hinge_loss:", ours)
sklearn hinge_loss: 0.012641047919911632
numpy hinge_loss: 0.012641047919911632
## 7) Pros, cons, and when to use hinge loss
### Pros
- **Convex** (for linear models): optimization is well-behaved (no local minima).
- **Margin-aware**: doesn’t just separate classes; encourages a safety buffer.
- **Sparse dependence on data** (SVM view): only points with $m \le 1$ influence the solution.
- Often strong performance for **high-dimensional** classification (e.g., text with bag-of-words / TF-IDF).
### Cons
- **Non-smooth** at $m=1$ (requires subgradients or a smoothed variant).
- Produces **uncalibrated scores** (unlike logistic loss, it’s not a log-likelihood).
- Not ideal when you need **probabilities** or well-calibrated uncertainty.
- Can be sensitive to **label noise** near the boundary (like most margin-based methods).
### Good use cases
- Binary or multiclass classification when you care about **large margins**.
- Linear classification on large, sparse feature spaces (classic SVM territory).
- As a surrogate for the 0–1 loss when you need a convex objective.
8) Common pitfalls and diagnostics#
Use decision scores: hinge loss needs raw scores (e.g.,
decision_function), not probabilities.Label encoding: math is cleanest with \(y\in\{-1,+1\}\); many libraries accept
{0,1}but be explicit.Feature scaling: for linear models with L2 regularization, scaling can strongly affect the margin and the effective regularization.
Class imbalance: hinge loss itself doesn’t fix imbalance; consider class weights or re-sampling.
Interpretation: a lower hinge loss generally means larger margins, but it’s not a calibrated probability of correctness.
Exercises#
Implement squared hinge loss and compare optimization behavior (smoother gradients).
Add L1 regularization and see how it changes sparsity in
w.Compare hinge vs logistic loss on the same dataset: decision boundary, calibration, and outliers.
Implement SGD (mini-batches) for the hinge objective and compare convergence.
References#
scikit-learn API: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.hinge_loss.html
Vapnik, The Nature of Statistical Learning Theory
Cortes & Vapnik (1995), Support-vector networks